1. Introduction

What is Ecobici?

Ecobici is a government sponsored program to encourage people to use bycicles as a means of transportation in Mexico City. Users of the program pay an annual fee to borrow a bike at any time for a maximum duration of 45 minutes. There are bike stations in different parts of the city and one can take a bike on any station and return it at any other.

Why analyze Ecobici?

While the program has enjoyed good general reception since it was launched in 2010, there is still a lot that can be done to further increase its adoption. In order to create a successful expansion plan, one must gather knowledge about the factors that seem to influence adoption, though. One way to do that is by understanding the current state of adoption faceted by different traits, such as gender, age, location, etc.

For this project, we’ll attempt to answer the following questions:

  1. Who uses Ecobici more: men or women?
  2. Does age play a role in the adoption of Ecobici?
  3. Who are the most devoted users of Ecobici by region?
  4. What is the rate of adoption of the program throughout the years?
  5. What’s the relationship between age and the number of rides a person takes?
  6. What’s the relationship between tenure and the number of rides a person takes?
  7. How active are users of Ecobici in terms of rides per week?

Surely other questions will pop up as we do the analysis, and we’ll try to answer those as well.

2. The Dataset

The dataset we’ll be analyzing in this project is that of users of the Ecobici program who signed up between February 15, 2010 and December 31, 2013. This and other datasets related to Ecobici can be downloaded from Mexico City’s Data Labs official website.

The columns of the dataset include:

  1. User ID
  2. Card ID
  3. Gender
  4. Date of birth
  5. Borough (where the user lives)
  6. Municipality (where the user lives)
  7. State (where the user lives)
  8. Type of registration (web, offline, telmex)
  9. Date of registration
  10. Rides count
  11. Status (active vs inactive)

(Later on when we prepare the dataset, you’ll notice that, in the registration.type variable, there’s one value called “ALTA TELMEX”. Telmex is the largest phone company in Mexico, and has partnered with Mexico City’s government to allow payment of the annual fee for Ecobici through a user’s phone bill.)

We’ll also explore some variables that are not included in the previous list, but which can be easily derived from the original dataset:

  1. Age
  2. Tenure (number of days since the user registered in the program)

Dataset Preparation

Before we begin, we’ll go through a couple of steps to put the dataset in a form more suitable for analysis.

We’ll read in the data from the ecobici_usuarios.csv file, specifying that we don’t want to read strings as factors. This is because we don’t want dates to be turned into factors. We also want to strip whitespace so as to avoid additional spurious factor levels later on.

## [1] "Spanish_Spain.1252"

Our dataset contains 111935 entries of 11 variables, and by running the str command, we can see that all columns are named in Spanish:

## 'data.frame':    111935 obs. of  11 variables:
##  $ USUARIO             : int  145 669 856 865 26538 956 28702 901 990 980 ...
##  $ TARJETA             : chr  "2938835630" "2938833614" "2861480206" "2935342734" ...
##  $ SEXO                : chr  "M" "M" "M" "M" ...
##  $ FECHA.DE.NACIMIENTO : chr  "1964-11-19" "1978-08-14" "1979-04-01" "1978-09-07" ...
##  $ COLONIA             : chr  "El Prado" "Zona Escolar" "Condesa" "Cuauhtémoc" ...
##  $ DELEGACION          : chr  "Coyoacán" "Gustavo A. Madero" "Cuauhtémoc" "Cuauhtémoc" ...
##  $ ESTADO              : chr  "D.F." "D.F." "D.F." "D.F." ...
##  $ MEDIO.DE.INSCRIPCION: chr  "ALTA WEB" "ALTA WEB" "ALTA" "ALTA WEB" ...
##  $ FECHA.DE.INSCRIPCION: chr  "2010-02-16" "2010-02-18" "2010-02-19" "2010-02-19" ...
##  $ USOS                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ STATUS              : chr  "Vigente" "Vigente" "Vigente" "Vigente" ...

So we’ll translate them to English to make it easier for everyone to understand what each variable stands for:

##  [1] "user.id"           "card.id"           "gender"           
##  [4] "birthday"          "borough"           "municipality"     
##  [7] "state"             "registration.type" "registration.date"
## [10] "rides.count"       "status"

Some of our categorical variables (registration.type, status) also have names in Spanish, or are not totally clear (gender). We’ll translate those as well.

## [1] "ALTA WEB" "ALTA WEB" "ALTA"     "ALTA WEB" "ALTA"     "ALTA"
## [1] "Vigente"  "Vigente"  "Vigente"  "Vigente"  "Inactivo" "Vigente"
## [1] "M" "M" "M" "M" "M" "F"

After a little bit of cleaning and parsing (see the accompanying Rmd file for details), we get the following structure:

## 'data.frame':    110922 obs. of  11 variables:
##  $ user.id          : int  145 669 856 865 956 28702 901 990 980 28269 ...
##  $ card.id          : Factor w/ 110050 levels "029170de","03909c2c",..: 78910 78894 57399 73629 69873 65478 71733 71608 79109 53457 ...
##  $ gender           : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 2 2 1 ...
##  $ birthday         : POSIXct, format: "1964-11-19" "1978-08-14" ...
##  $ borough          : Factor w/ 5502 levels "1 ampl presidentes",..: 1425 5497 1063 1158 4807 334 3522 4072 2083 4351 ...
##  $ municipality     : Factor w/ 222 levels "acambay","acolman",..: 44 67 48 48 108 108 22 48 48 22 ...
##  $ state            : Factor w/ 30 levels "aguascalientes",..: 7 7 7 7 7 7 7 7 7 9 ...
##  $ registration.type: Factor w/ 3 levels "normal","telmex",..: 3 3 1 3 1 1 1 3 1 1 ...
##  $ registration.date: POSIXct, format: "2010-02-16" "2010-02-18" ...
##  $ rides.count      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ status           : Factor w/ 2 levels "inactive","active": 2 2 2 2 2 1 1 2 2 1 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1013] 5 15 910 1419 2395 2437 2600 2652 2922 3036 ...
##   .. ..- attr(*, "names")= chr [1:1013] "5" "15" "910" "1419" ...

And the values for our categorical values are now clearer:

##   registration.type   status gender
## 1               web   active   male
## 2               web   active   male
## 3            normal   active   male
## 4               web   active   male
## 6            normal   active female
## 7            normal inactive   male

Now we’ll extract some extra variables. We’ll begin with tenure (the number of days since the user registered):

## 'data.frame':    110922 obs. of  1 variable:
##  $ tenure: num  1414 1412 1411 1411 1411 ...

Then we’ll add age:

## 'data.frame':    110922 obs. of  1 variable:
##  $ age: num  49 35 34 35 39 3 3 40 62 3 ...

As is usual with datasets in the real world, there’s some bad data that we need to filter out. For our dataset, it turns out that there are 26 people who were born in the future or were riding an adult bike at age 3 or less!

## [1] 26 13

So, let’s remove those and get ready for our exploratory data analysis. Let’s get started!

3. Exploration

Bird’s-eye view of the dataset

To get a bird’s-eye overview of our dataset, let’s create a scatterplot matrix:

Some of the resulting plots are very easy to interpret (e.g. histograms of categorical variables), while others, such as the histogram and density plots of rides.count require further investigation to understand completly. In the following sections, we’ll try to understand such plots better.


Univariate Analysis and Plots

Let’s begin by taking a look at summary of the variables of interest, one by one in order to avoid the crammed output that doing summary(users) would give us:

Age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   28.00   33.00   35.29   41.00   83.00

Tenure

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   191.0   386.0   544.9  1015.0  1415.0

Number of rides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    6.00   34.92   38.00 1254.00

Gender

## female   male 
##  42311  68585

Status

## inactive   active 
##    13216    97680

State, Municipality and Borough

##        state                  municipality                borough     
##  d.f.     :95822   cuauhtémoc       :35743   roma norte       : 6124  
##  edo. mex.:14843   miguel hidalgo   :19675   condesa          : 5203  
##  hidalgo  :   28   benito juárez    : 8857   cuauhtémoc       : 4617  
##  morelos  :   26   gustavo a. madero: 5800   juárez           : 2731  
##  querétaro:   21   álvaro obregón   : 4916   hipódromo        : 2557  
##  jalisco  :   20   coyoacán         : 3930   polanco v sección: 2444  
##  (Other)  :  136   (Other)          :31975   (Other)          :87220

Type of Registration

## normal telmex    web 
## 105158    977   4761

Birthday and Date of Registration

##     birthday                   registration.date            
##  Min.   :1930-05-30 00:00:00   Min.   :2010-02-15 00:00:00  
##  1st Qu.:1972-08-17 00:00:00   1st Qu.:2011-03-22 00:00:00  
##  Median :1980-08-12 00:00:00   Median :2012-12-10 00:00:00  
##  Mean   :1978-03-18 01:14:23   Mean   :2012-07-04 01:26:28  
##  3rd Qu.:1985-09-09 00:00:00   3rd Qu.:2013-06-23 00:00:00  
##  Max.   :2000-04-05 00:00:00   Max.   :2013-12-31 00:00:00

Some facts we can see right away include:

  • The number of registered female users (42311) versus male users (68585)

  • The oldest birthday (May 30 1930)

  • The states with most users: D.F. (Federal District) with 95822 users, followed by Edo. Mex. (State of Mexico) with 14843 users

  • The municipalities with most users: Cuauhtémoc with 35743 users and Miguel Hidalgo with 19675 users (which comes as no surprise, given that most bike stations are in those areas.)

  • The range of ages goes from 13 to 83 years old

  • The number of active users (97680) vs the number of inactive users (13216)

We’ll explore these facts and others in much more detail in the following sections.


Age

One of the original questions posed at the beginning of this project was: Does age play a role in the adoption of Ecobici? The following histogram may give us some insight into the answer:

A couple of things are immediately obvious:

  • There are no gaps in the range of ages,
  • The distribution is unimodal and skewed to the left,
  • The mode appears to be around 30, and
  • The outline of the histogram is kind of smooth (i.e. the function that relates adoption of the program and age of the users seems to be smooth.)

The skewness of the distribution is most likely due to the fact that Ecobici requires a credit card or phone bill invoice for registration. Underage people can still register, but they need to get written approval from their parents.

Let’s add some more detail to the graph to investigate the mean and median, as well as the interquartile range (IQR):

The median seems to be around 33, with the mean at around 35, a consequence of the longer right tail of the distribution. The mode, which represents the largest age group in the program, is right at 28 years old.


Age faceted by Gender

Let’s now compare the female and male populations in a frequency polygon plot:

It’s clear that the distributions are somewhat similar in shape but men dominate in number. To see this more clearly, let’s draw them in different plots in the same column, along with their mean, median and IQR:

Now it’s easier to see that the male population tends to be a bit older than the female population. Also, it seems like the variance of the male population is greater. Let’s compute the standard deviation to confirm or deny this perception:

## users$gender: female
## [1] 9.747954
## -------------------------------------------------------- 
## users$gender: male
## [1] 10.53073

And, indeed, the male population has a greater variance.

Finally, let’s create a boxplot that includes the very first outliers to get another look at the same data:

One additional piece of information that the boxplot gives us is the age at which men or women start being considered “rare” (the “outlier thresholds”) in their respective populations. Notice that in the case of men, we see the first outliers after age 63, while in the case of women it is after age 57. This supports our hypothesis that the male population tends to be older.


Age faceted by Type of Registration

Let’s now explore the distribution partitioned by type of registration:

It’s obvious from this plot how an overwhelming majority of users signup using the traditional means (as opposed to signing up on the web or using Telmex), but also how the “web” population dominates over the “telmex” population, albeit by a small margin (at least as compared to how the “normal” population dominates over the two.)

Let’s now compare the three populations on a free scale with a more elaborate histogram:

Clearly, the Telmex population seems to be much younger, though it’s unclear why this might be. One possible explanation is that younger people are less likely to have a credit card, so they may be using their phone bill (or their parents’) to join the program.

Running a summary by type of registration, we confirm that the Telmex population does tend to be younger:

## users$registration.type: normal
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   28.00   33.00   35.27   41.00   83.00 
## -------------------------------------------------------- 
## users$registration.type: telmex
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   24.00   29.00   32.35   39.00   75.00 
## -------------------------------------------------------- 
## users$registration.type: web
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   29.00   34.00   36.43   42.00   79.00

It also seems that the Telmex population has greater variance. Computing the standard deviation we can see whether this is true:

## users$registration.type: normal
## [1] 10.28671
## -------------------------------------------------------- 
## users$registration.type: telmex
## [1] 11.24539
## -------------------------------------------------------- 
## users$registration.type: web
## [1] 9.996479

Age faceted by Type of Registration and Gender

Finally, let’s make a plot where we can see the distribution of age by gender and type of registration to see whether women or men have a preference in the way they sign up to the program:

There appears to be no significant difference in the distributions of men and women in either the normal or web populations, but it could be argued that there’s something going on in the case of Telmex where the variations in the distributions of men and women are less in sync than in the other populations. This is an intriguing finding, but we have no further data to make a hypothesis about it.


Rides Count

Let’s now proceed to analyze the number of rides for users:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    6.00   34.92   38.00 1254.00

We seem to have a heavily skewed distribution, with a very thin and long tail, from which we can deduce that there’s a bunch of users each with a large number of rides. We can also see this if we tally the frequency of the number of rides and see it in reverse order:

## 
##  763  767  770  772  781  789  801  817  822  825  838  872  878  888  909 
##    1    1    2    1    1    1    1    1    1    1    1    1    1    1    1 
##  915  944  946 1077 1254 
##    1    1    1    1    1

To get a better visualization, we’ll apply a scale transformation on the x axis:

There’s an important and very evident finding here: a rather large number of people signed up for the program, used it a couple of times and then stopped using it. Let’s see the exact numbers:

## 
##     0     1     2     3     4     5     6     7     8     9 
## 40280  3857  3403  2809  2444  2195  1866  1754  1587  1502

There is a very large number of people with no rides. Could this be bad data or is it true that almost 37% of the users signed up for the program but then never actually used it at all?

Let’s add more information to the previous plot:

The median number of rides appears to be 6. Let’s confirm the numbers analytically:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    6.00   34.92   38.00 1254.00

This has really caught our attention. Let’s see what proportion of the total number of users have done less than the median number of rides:

## [1] 0.5126785

This number is quite surprising, around 51% of users are people who have just tried the program a couple of times! Certainly not what we might call “active users”. This is an important finding.

Now, let’s analyze whether this phenomenon appears regardless of gender:

In the case of women, the phenomenon seems to be much worse with a median of 3 rides, which we can confirm analytically:

## users$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    3.00   25.38   26.00  944.00 
## -------------------------------------------------------- 
## users$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    8.00   40.81   48.00 1254.00
## users$gender: female
## [1] 26
## -------------------------------------------------------- 
## users$gender: male
## [1] 48

These last numbers show something surprising: women have a median of 3 rides, while men have a median of 8 rides. The means are higher because of outliers in both groups that drag them up to 40.8117518 in the case of men, and 25.3770414 in the case of women.

Let’s the see the proportion of women (relative only to the female population) who have taken a number of rides equal to or less than the median:

## [1] 0.5013117

Despite the lower median, the proportion of “active users” (informally defined here as those who have taken more than the median number of trips in their population) for women is actually quite close to that of the general population.

However, if we just concentrate on the values below the median, and we look at the proportions, rather than the full counts, it’s immediately obvious that the phenomenon is a bit more pronounced in the case of women:

From this last plot, there’s no doubt that the most devoted “active” users of Ecobici are men (people with over a 1000 rides.) Also interesting is that despite the downward trend after about 25 rides for women, they make an important comeback when getting near 1000 with a proportion of around 0.30

Finally, let’s see what additional information about outliers we can glean from a boxplot:

Well, it seems that women with over a 150 rides are quite rare as far as their population goes, while men need to surpass the 300 rides to be considered rare.


Tenure

We’ll now take a look at tenure, the number of days a user has been in the program:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   191.0   386.0   544.9  1015.0  1415.0

The distribution shown by the previous plot tells us at least three things:

  1. Most people using Ecobici joined relatively recently (within the last year and a half or so).

  2. The second biggest group of users seems to correspond to the early adopters back in 2010 when Ecobici was first launched.

  3. Something happened after the second year of operation that caused a diminished number of new users. Maybe there was just not enough advertising or only a limited number of bike stations was available, thus limiting the number of potential users. However, that clearly changed during the third year of its operation.

Let’s change the binwidth of the histogram to 1 to see if we notice anything unusual:

The overall shape of the distribution seems the same, but we do see two or three very distinct points with well over 400 counts. We’re not sure what these might represent, but they don’t seem to warrant much attention.

Another comparison we can make is whether the Federal District and the State of Mexico have the same tenure distributions. Let’s find out:

The overall shapes of the distributions are quite similar, but the State of Mexico does have a lower median and mean, and generally lower amounts of people with tenures between 300 and 500 days. This means that for some reason after its first year of operation, the number of new Ecobici users coming from the State of Mexico declined much more rapidly than its counterpart in the Federal District.

Finally, let’s see whether faceting by gender shows anything interesting:

The distributions are really similar in shape, so the answer appears to be no.


Location (State, Municipality and Borough)

Our dataset contains three variables related to location. We’re interested in seeing which locations have the most users. For the case of states there’s really no point in plotting in detail, as the Federal District and the State of Mexico dwarf all other states, as can be seen in the following plots:

In the case of municipalities, we’ll focus on those in the Federal District:

For someone who lives in the Federal District, it’s obvious that there’s more municipalities in this graph than there should be. The Federal District is divided into 16 regions (known in Spanish as Delegaciones), and this graph shows way more regions. Taking a cursory look at the regions displayed, it appears that some of them are part of the State of Mexico. Unfortunately, there’s many of them obscuring those that we care about.

So it seems like we’ve hit a dead-end here, as a proper analysis of users by state and municipality requires the data to be correct. One thing we could do is to ensure consistency between municipalities and states, but doing so requires an official listing of municipalities per state (a “golden standard”), as well as applying some fuzzy matching techniques to ensure all municipalities have a single unique name (as it stands right now, it’s quite possible that the same municipality name was input using just slightly different names, due to spelling errors, data corruption, lack of accents, etc.)

Nevertheless, it’s likely that, even if states and municipalities are not matched up correctly in many cases, the rest of the data is fine (i.e. people did write correctly their municipality and borough of residence), so we can still plot a histogram including the municipalities and boroughs with most registered users.

We’ll do just that by sorting the regions and making them an ordered factor before we plot (you can see the details in the Common Functions section above):

Clearly, the distribution is really far from uniform, with just two or three municipalities leading the charts.

Let’s do the same plot for boroughs now:

We can also see if there are places, amongst those with most users, where the proportion between men and women deviates significantly from that of the general population:

In general, men seem to be majority, but never exceeding 65%. Let’s see whether we can find a place where this exceeds that by including more boroughs in our plot:

Here we notice for the first time a significant variation: the borough “Centro (Area 1)” (part of Downtown Mexico City) does have a proportion of men close to 69%, the biggest thus far. This is somewhat surprising, as intuitively one would expect that region to have a more balanced proportion.


Bivariate Analysis and Plots

Having become acquainted with the individual variables of the dataset, we’ll now proceed to look at the relationships between continuous variables in it.

For some of the plots involving the whole dataset, as opposed to summaries, we’ll be using a sample of 30,000 rows to the make our plotting faster.

Age vs Tenure

We’ll begin by looking at the relationship between age and tenure. We wonder who tends to stick to the program longer, young people or older people:

Well, other than showing the grouping of people in three large clusters of tenure, the graph doesn’t really tell us much more. Let’s add a bit of jitter and some alpha to the points to see if we uncover anything about the distribution of points:

That’s just slightly better, but it doesn’t give us new information, all we see is things that we knew from the univariate analysis already (e.g. most of the population is young and middle age adults.)

There’s just too much data on the plot, so let’s try using a summary instead. We’ll add both the median and mean to the graph so we can compare, but we’ll use the median as a more representative measure of the population as a whole because it’s a more robust statistic in the presence of outliers:

Now, this is much better. We can see a trend more clearly. Let’s add a smoother to make it even more obvious:

The story that this plot tells us is that the early adopters of Ecobici were seniors. The rest of the population came on board later on, thus having less tenure. Also, the bowl-shaped curve of the regression line tells us that there’s been a wave of young adults (around 30 years old) adopting the program more recently.

And now let’s facet it by gender and registration kind to see any possible variations in the distributions of those subpopulations:

Now, this is a very interesting plot, because it tells very different stories for people who registered online versus those who did it using Telmex or in the traditional way. For example, for the web subgroup, it appears as though more elderly women starting at age 63 have continued to join the program after the first year of operation, in contrast to elderly males who have stopped joining in more recent years.

However, it’s very important to keep in mind the smaller relative sizes of the web and Telmex populations when trying to interpret the plot. The plot itself gives us a warning of this as the variances are rather large in each case, thus decreasing our confidence that the story it seems to tell can generalize well.


Age vs Rides Count

Our next exploration will be getting at one of the initial questions: does age play a role in the adoption and active use of Ecobici?

During our analysis of the distribution of the age variable, we partially answered this question. However, it’s important to assess how age relates to the activity a user shows in the Ecobici program.

Let’s begin with a simple and straightforward scatterplot:

Let’s try to see how the population is distributed by adding an alpha parameter:

The bulk of the data seems to be below 300 rides, so in order to get a better view, let’s scale the vertical axis, change the plotting color to something light, and add two-dimensional contour lines to see where most points concentrate:

Now we can see what we discovered in the univariate analysis regarding the large number of users with just a few or zero rides. We can also see how the maximum number of points concentrates at around age 28 with a number of rides that varies between 10 and 100.

It would be nice if we could incorporate information about tenure in this same plot. One way to do it is to divide tenure into a discrete number of intervals and color code the points according to that.

Let’s begin by creating a new variable tenure_in_semesters:

And we’ll also facet by gender to include even more information in the plot

Now we have even more information in one plot: we can easily see that most of the users with zero rides also have large tenures, which is interesting but still hard to understand. Was there a massive signup of users in the beginning of the program as part of a large media event, maybe?

A couple of other things are revealed by this plot:

  • New users tend to have few rides in their history, as expected
  • A majority of users joined the program in the last one or two years
  • Teenagers and other young people have joined the program very recently
  • Users who joined the program recently appear to be more active than those who joined earlier. We cannot tell for sure whether this is true with just this information, but there’s an easy way to confirm or deny this, as we’ll see in a moment.

If we really want to understand how active users are, we need to plot not the raw count of rides, but instead something that reveals their daily activity, such as the average number of rides per week that they’ve taken. Let’s do that now:

With this last plot, we can confirm that users who joined in the last year or two tend to be more active on average than those who joined earlier.


We’ve discovered quite a lot of things so far using scatterplots. Let’s now turn our attention to creating plots that use summary statistics on the number of rides per age group rather than the raw counts:

Here there’s a clear trend: young people between the ages of 15 and 25 are the ones who take the most rides, and there’s a steady decline in the number of rides as people get older. There are, of couse, outliers that drag the mean upwards, but we care only about the typical user for the most part.

As we’ve done before, we’ll check what the distributions look like for different segments of our population, but the same caveats mentioned in the last section apply here regarding the remarkable differences in the distributions for Telmex and web registration types:

One thing worth noticing, though, is the difference in the distributions between men and women in the normal subpopulation. Notice how the trends seem to reverse betwen the ages of 70 and 80, going down in the case men and going up significantly in the case of women. Could this be due to the presence of outliers in a relatively small group? Let’s keep in the mind that the number of people in that age range is just 0.3922594% of the total.


4. Final Plots and Summary

Our analysis of the Ecobici users has helped us answer all of the questions we had in mind before beginning our exploration, but it has also revealed some unexpected findings. In the following sections, we’ll recapitulate the analysis done and present three plots that gave us important insights about the data.

Who uses Ecobici more: men or women?

While it was clear from the beginning that men used Ecobici more than women, the extent to which this was true wasn’t so clear. During the analysis, we found out that there was a downward trend in the proportion of women that completed an ever increasing number of rides, up to the point where men were the only ones to complete more than a thousand rides.

Also interesting in this plot is the fact that, if we divide the population by the type of signup process they used to enroll in the program, the distributions of each group do change substantially.

We take this result with a grain of salt, however, given that the sizes of the Telmex and web groups are orders of magnitude smaller than that of the “normal” group, so it’s likely that they will have much higher variance and may not faithfully reflect the true underlying distribution (assuming that there is indeed such an out-of-sample distribution which is substantially different from the “normal” distribution)

In this plot there’s also a hint of another phenomenon that we’ll see more clearly in the following section, namely that there’s a large proportion of people who signup for the program but end up taking just a few rides (or none at all.)

Let’s also take a look at the statistics on the number of rides by gender. Let’s run the by command passing the summary and sd (standard deviation) functions to get these results.

First the summary function:

## users$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    3.00   25.38   26.00  944.00 
## -------------------------------------------------------- 
## users$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    8.00   40.81   48.00 1254.00

And then the sd function:

## users$gender: female
## [1] 51.97177
## -------------------------------------------------------- 
## users$gender: male
## [1] 73.78403

Clearly, the number of rides completed by the median person is extremely low, especially in the case of women, and their lower standard deviation also shows a less diverse population. We’ll see this result graphically in the following plot as well.

One might be quick to think that this is surely the result of the relatively large number of newcomers to the program, but sadly an even worse result comes up when we filter out people who joined in the last six months (using 31 December, 2013 as the reference date):

## older_users$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00   24.83   22.00  944.00 
## -------------------------------------------------------- 
## older_users$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    4.00   40.27   43.00 1254.00

This is a strong indication that, despite the large number of people who have joined, the program didn’t really take off, and people just didn’t use it quite as much as we might have expected.

It’s only the most enthusiastic of users that drag the mean upwards to about 25 rides in the case of women and to about 41 for men. This is especially true in the case of women where we can observe the third quartile being smaller than the mean.

Another thing we might be interested in seeing, rather than just the raw counts, is the average number of rides people have taken per week during the time they’ve been in the program. This is not hard to compute, as we have the tenure available:

## valid_users$gender: female
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.09013  0.87690  0.80770 38.61000 
## -------------------------------------------------------- 
## valid_users$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2121  1.4160  1.5620 66.1000

The result shows that, viewed as a whole, men appear to use Ecobici twice as much per week as women, but again the average number of rides per week doesn’t exceed two for either gender, which is sad news for the program, of course.

Active users of Ecobici

The Ecobici dataset contains a field called status which tells us whether a given user is currently paying their annual fee, but that is obviously not enough to tell whether they actually use the program or not and to what extent.

So, to find this out, we analyzed the number of rides people have taken against their age, the number of years they’ve been enrolled in the program, as well as their gender, in hopes of finding patterns:

This plot reveals several interesting things:

  • Newcomers tend to have few rides in their history, as expected

  • The largest age groups in women seem to be confined to a more narrow range than those of men (notice how the two-dimensional density lines are more tightly packed and more to the left.) In other words, the female population tends to be younger than the male population in general, and their age range is narrower than that of men.

  • A majority of users joined the program in the last one or two years

  • Most teenagers and other young people have joined the program very recently

  • Users who joined the program recently appear to be more active than those who joined earlier. We confirmed this during our analysis by plotting not against the raw number of rides for each user, but rather against the average number of rides per week.

  • The horizontal stripes at the bottom of the plot shows that there is a large number of people who signed up for the program, took a few rides and then stopped using it. There are even people who didn’t use the service at all. In this latter case, they appear to be mostly people who signed at the beginning of the program back in 2010 (notice the dark blue stripe at the bottom.)

From the graph, it’s not very clear whether there is a correlation between age and the number of rides a person takes, but we can run an analytical test to get this information:

## 
##  Pearson's product-moment correlation
## 
## data:  users$age and users$rides.count
## t = -22.548, df = 110890, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07341162 -0.06169413
## sample estimates:
##        cor 
## -0.0675552

The result is negative and quite small, so it’s hard to make any conjectures. But this is understandable, after all there is quite a lot of variation in each age group (this is also quite evident from the shape of the plot.) So what if we took the average (and median) number of rides per age and then ran the test again. Let’s find out for the average first:

## 
##  Pearson's product-moment correlation
## 
## data:  age and rides.count.mean
## t = -3.4549, df = 69, p-value = 0.0009456
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5665722 -0.1655609
## sample estimates:
##       cor 
## -0.384031

And then for the median:

## 
##  Pearson's product-moment correlation
## 
## data:  age and rides.count.median
## t = -8.2672, df = 69, p-value = 6.488e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8060657 -0.5651258
## sample estimates:
##        cor 
## -0.7054224

This seems like a more significant result. There seems a strong negative correlation between age and number of rides, meaning that the older you are the less likely you are to take a ride in Ecobici (even if you did have good intentions and joined the program.) We saw this during our exploratory analysis too, but the numbers here confirm the result analytically. This result aligns well with our intuition about the relationship between the two variables.

5. Reflection

We started our exploration with some initial questions in mind, but we also knew we might end up exploring other things as we dug deeper into the relationships hidden in the data.

The scatterplot matrix was a very useful tool to give us a very quick overview of what laid ahead, as it gave us some direction of what might be interesting to look at.

Our analysis then began with an attempt to understand each variable in isolation: histograms, augmented with lines showing its mean, median and interquartile range, were of great help here. Boxplots gave use additional information about what was considered an outlier in each distribution. We also made use of faceted views and coloring to visualize single variables for different subgroups (e.g. males vs. females, people who signed up offline vs. on the web.)

Our analysis then proceeded to consider the relationship between pairs of continuous variables, and we realized that sometimes a direct plot of a variable against another doesn’t really provide much insight into the data. That’s when we turned to using summary statistics. We tried using both the mean and the median, but we settled for the median when trying to make conjectures because our dataset contained too much variation and outliers, making the mean a poor choice to understand the typical member of a population.

Difficulties

One of the initial difficulties we ran into during the analysis was related to the fact that the data wasn’t completely tidy. We realized this as we were doing the visualizations and had to step back and clean it up a bit before proceeding. We tried to mitigate this to some degree with some simple cleaning steps, but we were aware that more work would be needed for a completely clean dataset.

Another difficulty we had was interpreting some of the data correctly. Specifically, in the case of the subpopulations defined by the kind of registration, we were initially inclined to believe that the differences we saw in their distributions and statistical summaries might be an indication of a truly different kind of population. But, as progressed through the analysis, we came to realize that we should be very careful to jump to conclusions because those subpopulations were very small in size and therefore very prone to high variance.

Successes

We were able to answer all of the initial questions we had in mind, and we were also lucky to make some important discoveries regarding the adoption of the Ecobici program that may be very valuable to improve the program.

One of those discoveries is the unfortunate pattern that a considerable proportion of the people who sign up to the program end up using it just a couple of times, something that happened even more frequently in the case of women.

Another discovery we made was that, despite getting stuck in its growth during its second year of operation, the program made a comeback in its third and fourth year that attracted a lot of young people who are very active in terms of rides per week.

Future Work

One piece of future work that would be fantastic is to create a heatmap of Mexico City showing the levels of adoption by region (i.e. municipality). This requires, of course, cleaning up the data more thoroughly, and getting a shapefile of all the municipalities that comprise Mexico City. With that geospatial data, we could easily visualize other things too, such as user adoption faceted by gender, by “activity status” (as measured by the number of rides in a year), etc.

Other possibilities for future analysis include merging other datasets (or their summaries) with this one to find even more interesting facts about the users of Ecobici. For instance, the website where we obtained the dataset used in this analysis, also makes available datasets regarding Ecobici stations and all rides taken since its launching in 2010. This could help us answer questions such as: what’s the average time that people of different ages exercise using Ecobici? What are some of the most/least commonly used stations at certain times during the day? Where should we place the next Ecobici station so that it benefits active and devoted users the most?